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Abstract 

Reciprocating interactions represent a central feature of all human exchanges. They have 
been the target of various recent experiments, with healthy participants and psychiatric popu¬ 
lations engaging as dyads in multi-round exchanges such as a repeated trust task. Behaviour 
in such exchanges involves complexities related to each agent's preference for equity with 
their partner, beliefs about the partner's appetite for equity, beliefs about the partner's model 
of their partner, and so on. Agents may also plan different numbers of steps into the future. 
Providing a computationally precise account of the behaviour is an essential step towards 
understanding what underlies choices. A natural framework for this is that of an interactive 
partially observable Markov decision process (IPOMDP). However, the various complexities 
make IPOMDPs inordinately computationally challenging. Here, we show how to approxi¬ 
mate the solution for the multi-round trust task using a variant of the Monte-Carlo tree search 
algorithm. We demonstrate that the algorithm is efficient and effective, and therefore can be 
used to invert observations of behavioural choices. We use generated behaviour to elucidate 
the richness and sophistication of interactive inference. 


1 Author Summary 


Agents interacting in games with multiple rounds must model their partners' thought pro¬ 
cesses over extended time horizons. This poses a substantial computational challenge that has 
restricted previous behavioural analyses. By taking advantage of recent advances in algorithms 
for plarming in fhe face of uncerfainfy we demonsfrafe how fhese formal mefhods can be ex- 
fended. We use a well sfudied social exchange game called fhe frusf fask fo illusfrafe fhe power 
of our mefhod, showing how agenfs wifh parficular cognifive and social characferisfics can be 
expecfed fo inferacf, and how fo infer fhe properties of individuals from observing their be¬ 
haviour. 
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2 Introduction 


Successful social interactions require individuals to understand the consequences of their ac¬ 
tions on the future actions and beliefs of those around them. To map these processes is a com¬ 
plex challenge in at least three different ways. The first is that other peoples' preferences or 
utilities are not known exactly. Even if the various components of the utility functions are held 
in common, the actual values of the parameters of partners, e.g., their degrees of envy or guilt 
iniiiiifflBiii, could well differ. This ignorance decreases through experience, and can be 
modeled using the framework of a partially observable Markov decision process (POMDP). 
However, normal mechanisms for learning in POMDPs involve probing or running experi¬ 
ments, which has the potential cost of partners fooling each other. The second complexity is 
represented by characterizing the form of the model agents have of others. In principle, agent 
A's model of agent B should include agent B's model of agent A; and in turn, agent B's model 
of agent A's model of agent B, and so forth. The beautiful theory of Nash equilibria ex¬ 
tended to the case of incomplefe information via so-called Bayes-Nash equilibria jS) dispenses 
with this so-called cognitive hierarchy ||9l [TOl HlJ 1121 , looking instead for an equilibrium solu¬ 
tion. However, a wealth of work (see for instance HU) has shown that people deviate from 
Nash behaviour. If has been proposed people insfead model others to a strictly limited, yet 
non-negligible, degree II^ITOl. 

The final complexity arises when we consider that although it is common in experimental eco¬ 
nomics to create one-shot interactions, many of the most interesting and richest aspects of be¬ 
haviour arise wifh mulfiple rounds of interacfions. Here, for concrefeness, we consider the 
multi round trust task, which is a social exchange game that has been used with hundreds of 
pairs (dyads) of subjects, including both normal and clinical populations IITTIITSIIT^ITTIITHI . This 
game has been used to show that characteristics that only arise in multi-round interactions such 
as defection (agent A increases their cooperation between two rounds; agent B responds by 
decreasing theirs) have observable neural consequences that can be measured using functional 
magnetic resonance imaging (fMRI) IIT91IT41 [20112111221 . 

The interactive POMDP (IPOMDP) 1231 is a theoretical framework fhaf formalizes many of fhese 
complexities. It characterizes the uncertainties about the utility functions and planning over 
multiple rounds in terms of a POMDP, and consfructs an explicif cognifive hierarchy of models 
abouf fhe other (hence the moniker 'interactive'). This framework has previously been used 
with data from the multi-round trust task l24l [201 . However, solving IPOMDPs is computa¬ 
tionally extremely challenging, restricting those previous investigations to a rather minuscule 
degree of forward planning (just two- out of what is actually a ten-round interaction). Our main 
contribution is the adaptation of an efficient Monte Carlo tree search method, called partially 
observable Monte Carlo planning (POMCP) to IPOMDP problems. Our second contribution is 
to illustrate this algorithm through examination of the multiround trust task. We show charac¬ 
teristic patterns of behaviour to be expected for subjecfs with particular degrees of inequalify 
aversion, ofher-modeling and planning capacifies, and consider how fo inverf observed be¬ 
haviour to make inferences abouf fhe nature of subjects' reasoning capacities. 


2 


3 Materials and Methods 


We first briefly review Markov decision processes (MDPs), their partially observable extensions 
(POMDPs), and the POMCP algorithm invented to solve them approximately, but efficiently 
These concern single agents. We then discuss IPOMDPs and the application of POMCP to 
solving them when there are multiple agents. Finally, we describe the multi-round trust task. 


3.1 Partially Observable Markov Decision Processes 



Figure 1 : A Markov decision process. The agent starts at state so and has two possible actions ai and 
02. Exercising either, it can transition into three possible states, one of which (s2) can be reached through 
either action. Each state and action combination holds a particular reward expectation R{a, s). Based on 
this information, the agent can choose an action and transitions with probability T(s, a, so) to a new state 
s, obtaining an actual reward r in the process. The procedure is then repeated from the new state, with its' 
given action possibilities or else the decision process might end, depending on the given process. 


A Markov decision process (MDP) 1251 is defined by sets S of "states" and A of "actions", and 
several components that evaluate and link the two, including transition probabilities T, and 
information TZ about possible rewards. States describe the position of the agent in the envi¬ 
ronment, and determine which actions can be taken, accounting for, at least probabilistically, 
the consequences for rewards and future states. Transitions between states are described by 
means of a collection of transition probabilities T, assigning to each possible state s G S and 
each possible action a G A from that state, a transition probability distribution or measure 
= T(s, a, s) := P[.s|s, a] which encodes the likelihood of ending in state s after taking ac¬ 
tion a from state s. The Markov property requires that the transition (and reward probabilities) 
only depend on the current state (and action), and are independent from the past events. An 
illustration of these concepts can be found in figure]^ 
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Figure 2 : A partially observable Markov decision process. Starting from a observed interaction history h, 
the agents use their belief state B{h), to determine how likely they find themselves in one of fhree possible 
actual states s i, S2, S3. Any solution to the POMDP requires calculating all future action values depending 
on all possible future histories and the respective Belief sfafes. Following this, an observation o is obtained 
by the agent and the new history {h, a, 0} becomes the starting point for the next decision. 


By contrast, in a partially observable MDP (i.e., a POMDP 1261 1, the agent can also be uncertain 
about its state s. Instead, there is a set of observations o G O that incompletely pin down 
states, depending on the observation probabilities = yV{o,a,s) := P[o|s, a]. These report 
the probability of observing o when acfion a has occasioned a fransifion fo sfafe s. See figure]^ 
for an illusfration of fhe concepf. 

We use fhe nofafion st = s, at = a or ot = o fo refer explicifly fo fhe oufcome sfafe, acfion 
or observation af a given fime. The history h G H is fhe sequence of actions and observafions, 
wherein each acfion from fhe poinf of view of fhe agenf moves fhe fime index ahead by 1, 
ht := {oq, oo, oi, oi,..., Here oq may be frivial (deferminisfic or empfy). The agenf 

can perform Bayesian inference fo fum ifs history af fime t info a disfribufion P[St = st\ht] over 
ifs sfafe af fime t, where St denofes fhe random variable encoding fhe uncertainfy abouf fhe 
currenf sfafe af fime t. This disfribufion is called ifs belief sfafe B{ht), wifh 
P[S't = st\ht\. Inference depends on knowing T, W and fhe disfribufion over fhe initial state 
S'o, which we write as Bih^). Information about rewards TZ comprises a collection of ufilify 
funcfions r G TZ,r \ A x S x O ^ bl discount function P G TZ,T : N —?► [0,1] and a 
survival function H G Tl,H :NxN—[0, !]■ The utility functions determine the immediate 
gain associated with executing action a at state s and observing o (sometimes writing r* for fhe 
reward following fhe f* acfion). From fhe ufilifies, we define fhe reward function R : AxS ^ M., 
as fhe expected gain for faking acfion a af sfafe s as R{a, s) = E[r(o, s, o)], where fhis expecfafion 
is faken over all possible observafions o. Since we usually operate on hisfories, rafher fhan fixed 
sfafes, we define fhe expected reward from a given history h as R{a, h) := s)P[s|/i]. 

^ A more general definiton would heF £ 'R.,r : T-L x H ^ [0,1], allowing it to be conditional on the precise present 
and future histories. 


4 










The discount function weights the present impact of a fufure refurn, depending only on fhe 
separation befween presenf and fufure. We use exponenfial discounfing wifh a fixed number 
7 S [0,1] fo define our discounf funcfion: 

r(r —f)= 7 '^“h Vt, fGN,T>f. (3.1) 


Additionally, we define H such fhaf H{T,t) is 0 for t > K and 1 ofherwise. K in general is 
a random slopping time. We call fhe second componenf t fhe reference time of fhe survival 
funcfion. 

The survival funcfion allows us fo encode fhe plarming horizon of an agenf during decision 
making: If H (r, t) is 0 for t — t > P, we say fhaf fhe local plarming horizon af t is less fhan or 
equal fo P. 

The policy tt S II, 7r(a,/i) := P[a|fi] is defined as a mapping of histories fo probabilities over 
possible acfions. Here II is called fhe sef of admissible policies. For convenience, we somefimes 
wrife fhe disfribufion funcfion as 7r(/i). The value funcfion of a fixed policy tt sfarfing from 
presenf hisfory ht is 

OO 

V^{ht) := ^ t)E[rr\TT, K] (3.2) 

T — t 

i.e., a sum of fhe discounfed fufure expecfed rewards (nofe fhaf hr is a random variable here, 
nof a fixed value). Equally, fhe sfafe-acfion value is 


Q^a,ht):=Ria,ht)+ ^ 7"-*Tf(r, f)EK|7r,/i,]. (3.3) 


Definition 1 (Formal Definition - POMDP). Using the notation of this section, a POMDP is defined 
as a tuple (5, A, O, T, W, TZ, II, Bq) of components as outlined above. 

Convention 1 (Softmax Decision Making). A wealth of experimental work (for instance iE7ll^l2^ ) 
has found that the choices of humans (and other animals) can be well described by softmax policies based 
on the agents' state-action values, to encompass the stochasticity of observed behaviour in real subject 
data. See KSOjl . for a behavioural economics perspective and KTlf for a neuroscience perspective. In view of 
using our model primarily for experimental analysis, we will base our discussion on the decision making 
rule: 

^(a, K) = P[a|/i] = ePQAb.h) (^.4) 

where (3 > Q is called the inverse temperature parameter and controls how diffuse are the probabilities. 
The policy 


7r(a, h) 


I ifQ^{a, h) = max{(3’^(&, h)\b G A} (assuming this is unique) 
0 otherwise 


(3.5) 


can be obtained as a limiting case for /3 —>• oo. 

Convention 2. From now on, we shall denote by Q{a, h), the state-action value h) with respect 
to the softmax policy. 
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3.2 POMCP 


POMCP was introduced by IISTII as an efficient approximation scheme for solving POMDPs. 
Here, for complefeness, we describe fhe algorifhm; lafer, we adapf if fo fhe case of an IPOMDP. 

POMCP is a generative model-based sampling method for calculating history-action values. 
That is, it builds a limited portion of fhe free of fufure hisfories sfarfing from fhe currenf ht, 
using a sample-based search algorifhm (called upper confidence bounds for frees (UCT); 1321 1 
which provides guaranfees as fo how far from opfimal fhe resulfing action can be, given a 
certain number of samples (based on resulfs in l33l and l34l l. Algorifhm [^provides pseudo 
code for fhe adapfed POMCP algorifhm. The procedure is presenfed schematically in figure]^ 


1) SEARCH: Sample 
from beliefs: 



3) ROLLOUT: value 
estimate at a leaf 



2) SIMULATE: UCT 



4) Update & Repeat. New leaf 
nodes added. 



Figure 3: Image of POMCP. The algorithm samples a state s from the Belief state B{h) at the root R ( 
R representing the current history h), keeps this state s fixed till step 4), follows UCT in already visited 
domains (labelled tree nodes T) and performs a rollout and back update when hitting a leaf (labelled L). 
Then step 1) — 4) is repeated imtil the specified number of simulations has been reached. 


The algorithm is based on a tree structure T, wherein nodes T{h) = {N{h), Q{h), B{h)) represent 
possible future histories explored by the algorithm, and are characterized by the number N{h) 
of times hisfory h was visifed in fhe simulation, the estimated value Q{h) for visifing h and 
fhe approximafe belief sfafe B{h) af h. Each new node in T is initialized with initial action 
exploration counts Ninit{h,a) = 0 for all possible actions a from h and an initial action value 
esfimafe a) = 0 for all possible actions a from h and an empfy belief sfafe B{h) = 0. 

The value N{h) is fhen calculafed from all actions counfs from fhe node N{h) = ®)- 

Q{h) denofes fhe mean of obfained values, for simulations starting from node h. B{h) can 
eifher be calculafed analyfically if if is compufafionally feasible fo apply Bayes fheorem, or be 
approximafed by fhe so called root sampling procedure (see below). 

In ferms of fhe algorifhm, fhe generative model Q(s, a) of fhe POMDP defermines (s', o, r) ^ 
f/(s, a), fhe simulafed reward, observation and subsequenf sfafe for faking a af s; s ifself is 
sampled from fhe currenf hisfory h. Then, every (fufure) hisfory of acfions and observafions h 
defines a node T(h) in fhe free sfrucfure T, which is characferized by fhe available acfions and 
fheir average simulafed acfion values Q(a, h) under fhe policy SoftUCT af fufure sfafes. 
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Algorithm 1 Partially Observable Monte Carlo Planning 


procedure SEARCH(h , t / n) 

for SIMULATIONS = 1,..., n do 

k ^ t 

if ht = oq then 

s^Bo 

else 

s ~ Biht) 

end if 

SIMULATE (s, h, t, k) 

end for 

return a ^ SoftUCT 

end procedure 

procedure ROLLOUT(s,/i,Lfc) 
if H{k, t) <Q then 
return 0 
end if 

(X ~ '7rrollout(^5 ') 

(s',o,r) ~ Q{s,a) 
h -fr- {h, a, o} 
fc ^ fc + 1 

return r+ 7 ROLLOUT(s', h, t, k) 

end procedure 


procedure SIMULATE (s,h ,t, k) 
if H{k, t) < 0 then 
return 0 
end if 

if h then 

for all a G A do 

T{ha) ^ (Ainit(/i,a),(5init(a,/i),0) 

end for 

return ROLLOUT (s, h, t, k) 

end if 

a ~ SoftUCT 
(s',o,r) ~ Q{s,a) 
h -fr- {h, a, o} 
k ^ k + l 

R ^ r+ 7 SIMULATE(s', h, t, k) 

N{h) ^ N{h) + 1 
N{h, a) ^ N{h, a) + 1 
Q{a, h) ^ Q{a, h) + 

return R 
end procedure 


If the node has been visited for the N (/i)* time; with action a being taken for the N{h, a)* time, 
then the average simulated value is updated (starting from 0) using sampled simulated rewards 
R up to terminal time K, when the current simulation/tree traversal ends as: 

h) = h) + {r - h)) . (3.6) 

The search algorithm has two decision rules, depending on whether a traversed node has al¬ 
ready been visited or is a leaf of the search tree. In the former case, a decision is reached using 
SoftUCT by defining 

/ loti' N (h') p/5(Q(<i)^)) 

SOFTUCT Qia,h):=Q{a,h) + c^^^ (3.7) 

where c is a parameter that favors exploration (analogous to an equivalent parameter in UCT). 

If the node is new, a so-called "rollout" policy is used to provide a crude estimate of the value 
of the leaf. This policy can be either very simple (uniform or e—greedy based on a very simple 
model) or specifically adjusted to the search space, in order to optimize performance. 

The rollout value estimate together with the SoftUCT exploration rule is the core mechanism 
for efficient tree exploration. In this work, we only use an e—greedy mechanism, as is described 
in the section on the multi round trust game. 

Another innovation in POMCP that underlies its dramatically superior performance is called 
root sampling. This procedure allows to form the belief state at later states, as long as the initial 
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belief state Bq is known. This means that, although it is necessary to perform inference to draw 
samples from the belief state at the root of the search tree, one can then use each sample as if it 
was (temporarily) true, without performing inference at states that are deeper in the search tree 
to work out the new transition probabilities that pertain to the new belief states associated with 
the histories at those points. The reason for this is that the probabilities of getting to the nodes 
in the search tree represent exactly what is necessary to compensate for the apparent inferential 
infelicity IISTI - i.e., the search tree performs as a probabilistic filter. The technical details of the 
root sampling procedure can be found in ISTl . 

In the presence of analytically tractable updating rules (or at least analytically tractable approx¬ 
imations) the belief state at a new node can instead be calculated by Bayes' theorem. In the case 
for the multi round trust game below, we follow the approximating updating rule in lf2Dl . 


3.3 Interactive Partially Observable Markov Decision Processes 

An Interactive Partially Observable Markov Decision Process (IPOMDP) is a multi agent setting 
in which the actions of each agent may observably affect the distribution of expected rewards 
for the other agents. 

Since IPOMDPs may be less familiar than POMDPs, we provide more detail about them; consult 
12^ for a complete reference formulation and Il35l for an excellent discussion and extension. 

We define the IPOMDP such that the decision making process of each agent becomes a stan¬ 
dard (albeit large) POMDP, allowing the direct application of POMDP methods to IPOMDP 
problems. 


State Variable 


Environment Partner Model 



Partner Model 



Solution: 




= Qi\h] 


= 02 1 / 1 ] 



Full IPOMDP 
from Partners' 
View 


S' X 0]^ Solution 


5 X 02 Solution 
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Figure 4 : Interactive Partially Observable Markov Decision Process. Compared to a POMDP, the process 
is further complicated by the necessity to keep different models 0 of the other agents' intentions, so that 
evidence on the correct intentional model may be accrued in the belief state B{h). Any solution to the 
IPOMDP requires calculating all future action values depending on all possible future histories and the 
respective Belief states. 

Definition 2 (Formal Definition - IPOMDP). An IPOMDP is a collection ofPOMDPs such that the 
follozving holds: 

Agents are indexed by the finite set X. Each agent i G X is described by a single POMDP (5*, A^, O’', 
T’, W*, 70 ,IP, Bq) denoting its actual decision making process. We first define the physical state space 
‘^phys’ element s G S^^ys complete setting of all features of the environment that determine the 
action possibilities and obtainable rewards TZ’ of i for the present and all possible following histories, 
from the point of view of i. The physical state space is augmented by the set V’' of models of the 
partner agents 9 ’'^ G G X\{i}, called intentional models, which are themselves POMDPs O’l ={S’'f 
A’f O’f T’f yV’f TZ’f IPf Bq). These describe how agent i believes agent j perceives the world and 
reaches its decisions. The possible state space of agent i can be written S’ = x V’ and a given state 

can be written s’ = {s’, XjO’I), where s’ G is the physical state of the environment and B’l are 
the models of the other agents. Note that the intentional models 9’^ contain themselves state spaces that 
encode the history of the game as observed by agent j from the point of view of agent i. The elements of 
S’ are called interactive states. Agents themselves act according to the softmax function of history-action 
values, and assume that their interactive partner agents do the same. The elements of the definition are 
summarized in figure^ 

Convention 3. We denote by capital S and capital S the random variables, that encode uncertainty 
about the physical state and the interactive state respectively. 

When choosing the set of intentional models, we consider agents and their partners to engage in 
a cognitive hierarchy of successive mentalization steps IIT0l [9l, depicted in figure]^ The simplest 
agent can try to infer what kind of partner it faces (level 0 thinking). The next simplest agent 
could additionally try to infer what the partner might be thinking of it (level 1). Next, the agent 
might try to understand their partner's inferences about the agent's thinking about the partner 
(level 2). Generally, this would enable a potentially unbounded chain of mentalization steps. It 
is a tenet of cognitive hierarchy theory IITOl that the hierarchy terminates finitely and for many 
tasks after only very few steps (e.g., Poisson, with a mean of around 1.5). 


Computational Theory of Mind 


Level-1 


Level O 


Level 1 


Level 2 


#-• fr# #=# tN# 

We act on an ^ | What type ^ What ^ What type 

Impersonal is our type do do they 
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^ What tvoe 
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What type 

Impersonal 
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type do 


do they 

environment 

partner? 


they 


think, we 



think we 


think, they 



are? 


are? 
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Figure 5: Computational Theory of Mind (ToM) formalizes the notion of our understanding of other 
peoples thought processes. 


We formalize this notion as follows. 

Definition 3 (A Hierarchy Of Intentional Models). Since models of the partner agent may contain 
interactive states in which it in turn models the agent i, we can specify a hierarchical intentional structure 
built from what we call the level I > —I intentional models is defined inductively from 

G X {0}. 

This means that any level —1 intentional model reacts strictly to the environment, without holding any 
further intentional models. The higher levels are obtained as 

0^3,1 g x>^,i ^ gij.i ^ X ph’i-y 

Here denotes the I — 1 intentional models, that agent i thinks agent j might hold of the other 

players. These level I — 1 intentional models arise by the same procedure applied to the level —1 models 
that agent i thinks agent j might hold. 

Definition 4 (Theory of Mind (ToM) Level). We follow a similar assumption as the so called k-level 
thinking (see @), in that we assume that each agent operates at a particular level li (called the agent's 
theory of mind (ToM) level; and which it is assumed to know), and models all partners as being at level 
Ij — h t* 

We chose definition|^for comparability with earlier work Il2^l20ll . 

Convention 4. It is necessary to be able to calculate the belief state in every POMDP that is encountered. 
An agent updates its belief state in a Bayesian manner, following an action a\ and an observation 
This leads to a sequential update rule operating over the belief state P[.§j |/i)] of a given agent i at a given 
time t: 

P[5,Vi = Si|{h),aj,o)+i}] = 7?W(oj+i,a),Si) ^ aj,S)PK = s\h\]. (3.8) 

Here y is a normalization constant associated with the joint distribution of transition and observation 
probability, conditional on s, si, and a\. The observation in particular incorporates any results 
of the actions of the other agents, before the next action of the given agent. 

We note that the above rule applies recursively to every intentional model in the nested structure V', as 
every POMDP has a separate belief state. 

This is slightly different from Il23ll so that the above update is conventional for a POMDP. 

Convention 5 (Expected Utility Maximisation). The decision making rule in our IPOMDP treat¬ 
ment is based on expected utility as encoded in t he re ward function. The explicit formula for the action 
value Q{al, hi) under a softmax policy (equation \3.4]j is: 

Qial,hl)=Rial,hl)+ ^ PK+i|{h), a)}] ^ (3.9) 

Here ht+i = {/i), a), o)_|_i} and Q{b, \t) denotes the action value att + l with the survival function 
conditioned to reference time t. 7*^9 is the discount factor of agent i, rather than the i-th power. This 
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defines a recursive Bellman equation, with the value of taking action al given history hi being the expected 
immediate reward R{a\, hi) plus the expected value of future actions conditional on al and its possible 
consequences ol_^_i discounted by 7 *. 

The belief state B{hl) allows us to link h) to a distribution of interactive states and use W to calculate 
al], in particular including the reactions of other agents to the actions of one agent. We call 
the resulting policy the "solution" to the IPOMDP. 


3.4 Equilibria and IPOMDPs 

Our central interest is in the use of the IPOMDP to capture the interaction amongst human 
agents with limited cognitive resources and time for fheir exchanges. If has been nofed in ITOl 
that the distribution of subjecf levels favours rather low values (e.g., Poisson, with a mean of 
around 1.5). In fhe opposife limif, sufficienf condifions are known in which faking fhe cognifive 
hierarchy ouf fo infinify for all involved agenfs allows for at least one Bayes-Nash equilibrium 
solution (part II, theorem II, p. 322 of Harsanyi |0) and sufficient conditions have been shown 
in 1361 , given which a solution to the infinite hierarchy model can be approximated by the se¬ 
quence of finife hierarchy model solufions. A discussion of a differenf condifion can be found in 
EZl; however, fhis condifion does assume a infinife time horizon in fhe inferacfion. In general, 
as m, p .868 nofes, if is nof frue fhaf fhe infinife hierarchy solution will be a Nash equilibrium. 
For fhe purposes of compufafional psychiafry we find fhe very mismafches and limifafions, fhaf 
prevent subjects' strategies to evolve to a (Bayes)-Nash equilibrium in the given time frame, fo 
be of particular inferesf. Therefore we resfricf our attention to quantal response equilibrium like 
behaviours ( 1301 1 , based on potentially inconsistent initial beliefs by fhe involved agenfs with 
ultimately very limited cognitive resources and finite time exchanges. 


3.5 Applying POMCP to an IPOMDP 

An IPOMDP is a collection of POMDPs, so POMCP is, in principle, applicable fo each encoun- 
fered POMDP. 

However, unlike fhe examples in 1311 , an IPOMDP confains fhe infenfional model POMDPs 0*^ 
as parf of fhe sfafe space, and fhese fhemselves contain a rich structure of beliefs. So, fhe sfafe 
is sampled from fhe belief sfafe af fhe roof for agenf i is an / fuple (s*, 0 *^,..., 0 hl 2 ^l-i)) of a 
physical sfafe s* and (|I| — 1) POMDPs, one for each parfner. (This is also akin fo fhe random 
insfanfiafion of players in JS]). Since fhe 0 h sfill confain belief sfafes in fheir own righf, if is sfill 
necessary fo do some explicif inference during fhe creafion of each free. Indeed, explicif infer¬ 
ence is hard fo avoid alfogefher during simulafion, as fhe inferacfive sfafes require fhe partner 
to be able to learn l23| . Nevertheless, a number of performance improvemenfs fhat we defail 
below sfill allow us fo apply fhe POMCP mefhod involving subsfantial planning horizons. 


3.6 Simplifications for dyadic repeated exchange 

Many social paradigms based upon game theory, including the iterated ultimatum game, pris¬ 
oners' dilemma, iterated "rock, paper, scissors" (for 2 agents) and the multi round trust game. 
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involve repeated dyads. In these, each interaction involves the same structure of physical states 
and actions (iSphys, -d) (see below), and all discount functions are 0 past a finite horizon. 

Definition 5 (Dyadic Repeated Exchange without state uncertainty). Consider a two agent IPOMDP 
framework in which there is no physical state uncertainty: both agents fully observe each others' actions 
and there is no uncertainty about environmental influence; and in which agents vary their play only 
based on intentional models and an agent does not believe that the partner can be made to transition 
between different intentional models by the agent's actions. Additionally, the framework is assumed to 
reset after each exchange (i.e., after both agents have acted once). 

Formally this means: There is a fixed setting {Sphys,A,T), such that physical states, actions from these 
states, transitions in the physical state and hence also obtainable rewards, differ only by a changing time 
index and there is no observational uncertainty and an agent does not believe that the partner can be 
made to transition between different intentional models by the agent's actions. Then after each exchange 
the framework is assumed to reset to the same distribution of physical initial states S within this setting 
(i.e. the game begins anew). 


Games of fhis sorf admif an immediafe simplificafion: 

Theorem 1 (Level 0 Recombining Tree). In the situation of definition^ level 0 action values at any 
given time only depend on the total set of actions and observations so far and not the order in which those 
exchanges were observed. 


Proof. The level — 1 partner model only acts on the physical state it encounters and the physical 
state space variable S is reset at the begirming of each round in fhe sifuation of Therefore, 
given a sfafe s in fhe current round and an action a by a level 0 agent, the likelihood of each 
transifion fo some sfafe si, T{si,a, s), and of making observafion o, W(o, a, si), is fhe same at 
every round from fhe poinf of view of fhe level 0 agenf. If follows fhaf fhe cumulafive belief 
updafe from equation |3.8} from the initial beliefs Bq fo fhe currenf beliefs, will nof depend on 
fhe order in which fhe acfion observafion pairs (a, o) were observed. □ 


This means, fhat depending on fhe size of fhe sfafe space and the depth of plarming of inferesf, 
we may analyfically calculafe level 0 acfion values even online or use precalculated values for 
larger problems. Furfhermore, because fheir acfion values will only depend on pasf exchanges 
and nof on fhe order in which fhey were observed, fheir decision making free can be reformu- 
lafed as a recombining free. 

Somefimes, an addifional simplificafion can be made: 

Theorem 2 (Trivialised Planning). In the situation of definition^ if the two agents do not act simul¬ 
taneously and the state transition of the second agent is entirely dependent on the action executed by the 
first agent (as in the multi round trust task); and additionally the intentional model of the partner can 
not be changed through the actions of the second agent, then a level 0 second agent can gain no advantage 
from planning ahead, since their actions will not change the action choices of the first agent. 

Proof. In fhe scenario described in fheorem fhe physical sfafe variable S of fhe agenf 2 is 
entirely dependenf on fhe acfion of fhe ofher agenf. If fhe agenf is level 0, fhey model fheir 
partner as level —1 and by additional assumption the second agent does not believe that the 
partner can be made to transition between different intentional models by the second agent's 
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actions, hence their partner will not change their distribution of state transitions, depending 
on the agents' actions and hence also their distribution of fufure obfainable rewards will nof 
change. □ 

Theorem 3 (Trivialised Theory of Mind Levels). In the situation of theorem^ we state that for the 
first to go agent only the even theory of mind levels k € {0} U 2N show distinct behaviours, while the odd 
levels k G 2N — 1 behave like one level below, meaning k — 1. For the second to go partner equivalently, 
only the odd levels k G {0} U 2N — 1 show distinct behaviours. 

Proof. In fhe scenario described in fheorem|^ fhe second fo go level 0 agenf behaves like a level 
— 1 agenf, as if does nof benefif from modeling fhe parfner. This implies fhaf fhe firsf fo go agenf, 
gains no additional information af fhe level 1 fhinking, since fhe parfner behaves like level — 1, 
which was modeled by fhe level 0 firsf fo agenf already. In furn, fhe level 2 second fo go agenf 
gains no addifional informafion over fhe level 1 second fo go agenf, as fhe fheir parfner model 
does nof change between modeling fhe parfner af level 0 or level —1. By induction, we gef fhe 
resulf. □ 

Examples of fhe addifional simplifications in fheorems and can be seen in fhe ulfimafum 
game and fhe multi round frusf game. 


3.7 The Trust Task 


Multi Round Trust Game 


For lO consecutive rounds 
with the same partner: 
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Investor Trustee 


At every Step: 

• 21 possible Investor 
actions 

• 3*(amount of 
investment) -*-l possible 
conditional trustee 
responses 

• 631 possible distinct 
exchanges per turn 

• Hence the size of the 
whole decision making 
tree initially is 631''10 


Figure 6: Physical features of the multi round trust game. 


The multi-round frusf fask, illusfrafed in figure is a paradigm social exchange game. If in¬ 
volves two people, one playing fhe role of an 'investor' fhe ofher fhe one of a 'frusfee', over 10 
sequenfial rounds, expressed by a time index t = 1,2,..., 10- Bofh agenfs know all fhe rules 
of fhe game. In each round, fhe invesfor receives an initial endowmenf of 20 monefary unifs. 
The invesfor can send any of fhis amounf fo fhe frusfee. The experimenfer frebles fhis quan¬ 
tify and fhen fhe frusfee decides how much fo send back fo fhe invesfor, befween 0 poinfs and 
fhe whole amounf fhaf she receives. The repaymenf by fhe frusfee is nof increased by fhe ex¬ 
perimenfer. Affer fhe frusfee's action, fhe invesfor is informed, and fhe nexf round sfarfs. We 
consider fhe frusf fask as an IPOMDP wifh fwo agenfs, i.e., I = {I,T} confains jusf / for the 
investor and T for the trustee. We consider the state to contain two components; one physical 
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and observable (the endowment and investments), the other non-physical and non-observable 
(in our case, parameters of the utility function). It is the latter that leads to the partial observ¬ 
ability in the IPOMDP. Following 1241 . we reduce complexity by quantizing the actions and the 
(non-observable) states of bofh invesfor and frusfee - shown for one complefe round in figure 
The actions are quantized info 5 fractional cafegories shown in figure For fhe invesfor, 
we consider G {0,0.25,0.5, 0.75,1} (corresponding fo an invesfmenf of $20xa^, and encom¬ 
passing even invesfmenf ranges). For fhe frusfee, we consider G {0,0.167,0.333,0.5,0.67} 
(corresponding fo a refum of $3x20xa^a^, and encompassing even refurn ranges). Nofe fhat 
fhe frusfee's acfion is degenerafe if fhe invesfor gives 0. The pure monefary payoffs for bofh 
agenfs in each round are 

investor :x^,a^) = 20 — 20a'^ -|- 3 x 20a^a^. 
trustee :x^(a^, a^) = 3 x 20a^ — 3 x 20a^a^. 

The payoffs of all possible combinations and bofh parfners are depicfed in figure]^ In IPOMDP 
ferms, fhe invesfor's physical sfafe is sfafic, whereas fhe frusfee's sfafe space is conditional 
on the previous action of fhe invesfor. The invesfor's possible observations are the trustees 
responses, with a likelihood that depends entirely on the investor's intentional model of fhe 
frusfee. The frusfee observes fhe invesfor's acfion, which also defermines fhe frusfee's new 
physical sfafe, as shown in figure 




Trustee Returned 


Figure 7: Investor: (left) The 21 possible actions are summarized into 5 possible investment categories. 
Trustee: ( right) returns are classified into 5 possible categories, conditionally on investor action. Impossi¬ 
ble returns are marked in black. 
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Investor Monetary Gains — Trustee Monetary Gains — 



Figure 8: Payoffs in the multi round trust task, (left) Investor payoffs for an single exchange, (right) 
Trustee payoffs for an single exchange. 



Figure 9 : (Physical) Transitions and Observations: (left) Physical state transitions and observations of 
the investor. The trustee's actions are summarized to , as they can not change the following physical 
sfate transition, (right) Physical state transitions and observations of the trustee. The trustee's actions are 
summarized to , as they can not change the following physical state transition. 


3.7.1 Inequality Aversion - Compulsion to Fairness 

The aspects of the states of investor and trustee that induce partial observability are assumed to 
arise from differential levels of cooperation. 
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Figure 10: Immediate Fehr-Schmidt utilities for a single exchange (Tj. Left column shows investor pref¬ 
erences: (top left) Completely imguilty investor values only the immediate payoff, (middle left) Guilt 0.4 
investor is less likely to keep everything to themselves (bottom left corner option), (bottom left) Guilt 1 
investor will never keep everything to themselves (bottom left option). Right column shows trustee pref¬ 
erences: (top right) unguilty trusty would like to keep everything to themselves, (middle right) Guilt 0.4 
is more likely to return at least a fraction of the gains, (bottom right) Guilt 1 trustee will strife to return the 
fair split always. 

One convenient (though not unique) way to characterize this is via the Fehr-Schmidt inequality 
aversion utility function (figure [To). This allows us fo accounf for fhe observafion fhaf many 
frusfees refum an even splif even on fhe lasf exchange of fhe 10 rounds, even fhough no furfher 
gain is possible. We make no claim fhaf fhis is fhe only explanafion for such behaviour, buf 
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it is a tractable and well-established mechanism that has been used successfully in other tasks 
(ISlIIllill ). For the investor, this suggests that: 

r^{a^,a^,a^) = x\a^,a^) - max{x^(a^, a^) - x^(a^, a^), 0}. (3.10) 

Here, is called the "guilt" parameter of fhe invesfor and quanfifies fheir aversion fo unequal 
oufcomes in fheir favor. We quantize guilf info 3 concrefe guilf f 5 rpes {0,0.4,!} = {ai,Q! 2 ,a 3 }. 
Similarly, fhe frusfee's ufilify is 

r'^(a^, a^, a^) = x^(a^, a^) - max{x'^(a^, a^) - X^(a^, a^), 0}, (3.11) 

with the same possible guilt t 5 rpes. We choose these particular values, as guilt values above 
0.5 tend to produce similar behaviours as a = 1 and the values below 0.3 tend to behave very 
similar to a = 0. Thus we take oi to represent guilt values in [0,0.3], 02 to represent guilt values 
in (0.3,0.5) and 02 to represent guilt values in [0.5,1]. We assume that neither agents' actual 
guilt type changes during the 10 exchanges. 


3.7.2 Planning Behaviour 

The survival functions and are used to delimit the plarming horizon. The agents are 
required not to plan beyond the end of the game at time 10 and within that constraint they are 
supposed to plan P steps ahead into the interaction. This results in the following form for the 
survival functions (regardless whether for investor or trustee): 

Hp{T,t) = l, (r —t) < P A (r-I-1) < 10, iFp(r,t)=0, (r — t) > PV (r-l-t) > 10. (3.12) 

The value P is called the planning horizon. We consider P s (0,2,7} for immediate, medium 
and long plarming t 5 rpes. We chose these values as P = 7 covers the range of behaviours from 
P = 4 to P = 9, while planning 2 yields compatibility to earlier works ( Il24ll20ll l and allows to 
have short planning but high level agents, covering the range of behaviours for plarming P = 1 
to P = 3. We confirm later that the behaviour of P = 7 and P = 9 agents is almost identical; 
and the former saves memory and processing time. Agents are characterized to assume their 
opponents have the same degree of planning as they do. The discounting factors 7 ^ and 7 ^ are 
set to 1 in our setting. 


3.8 Belief State 

Since all agents use their own planning horizon in modeling the partner and level k agents 
model their partner at level k — 1, inference in intentional models in this analysis is restricted to 
the guilt parameter a. Using a categorical distribution on the guilt parameter and Dirichlet prior 
on the probabilities of the categorical distribution, we get a Dirichlet-Multinomial distribution 
for the probabilities of an agent having a given guilt t 5 rpe at some point during the exchange. 
Hence Bg is a Dirichlet-Multinomial distribution, 

Bo ~ DirMult{ao), oo = (1; 1; 1) 


with the initial belief state 

pj^partner = = 0] = ^ 

o 


17 


Keeping consistent with the model in Il20l , our approximation of the posterior distribution 
is also a Dirichlet-Multinomial distribution with the parameters of fhe Dirichlef prior being 
updafed fo 

= a\+ P[ot_|_i = observed acfionjaP®’'*™'' = Ui]. (3.13) 

wrifing for fhe infenfional models. 


3.8.1 Theory of Mind Levels and Agent Characterization 


Since the physical state transition of fhe frusfee is fully dependenf on fhe investors' acfion and 
one agenfs' guilf f 5 ^e can nof be changed by fhe acfions of fhe ofher agenf, fheorem|^ implies 
that the level 0 trustee is trivial, gaining nothing from plarming ahead. Conversely, fhe level 0 
investor can use a recombining free as in fheoremj^ Therefore, fhe chain of cognifive hierarchy 
stops for fhe investor is € {0} U {2n\n € N}, and for fhe frusfee, if is S {0} U {2n — l|n s N}. 
Trusfee plarm ing is frivial unfil fhe frusfee does af leasf reach fheory of mind level 1. Assuming 
/3 = I in 3.4 defermined empirically from real subjecf dafa IfZOl for suifably noisy behaviour, our 


subjecfs are fhen characferized via fhe friplef (fc, a, P) of fheory of mind level k, guilf paramefer 
a G {0,0.4,1} and planning horizon P G {0,2, 7}. 


3.9 Level —1 and POMCP rollout mechanism 

The level — 1 models are obfained by having the level — 1 agent always assume all partner t 5 rpes 
to be equally likely (P[aP®'P''^' = Oi] = |, Vi), setting the planning horizon to 0, meaning the part¬ 
ner acts on immediate utilities only, and calculating the agent's expected utilities after marginal¬ 
izing over partner t 5 q)es and their respective response probabilities based on their immediate 
utilities. 

In the POMCP treatment of fhe mulfi round frusf game, if a simulafed agenf reaches a given his¬ 
tory for fhe firsf fime, a value esfimafe for fhe new node is derived by freating fhe agenf as level 
— 1 and using an e-greedy decision making mechanism on fhe expecfed ufilities fo defermine 
fheir acfions unfil fhe presenf planning horizon. 


4 Results 


We adapted the POMCP algorithm IISTI to solve IPOMDPs Il23l , and cast the multi-round trust 
task as an IPOMDP that could thus be solved. We made a number of approximafions fhaf 
were prefigured in pasf work in fhis domain Il24ll20ll , and also made various observafions fhaf 
dramafically simplified fhe fask of planning, wifhouf alfering fhe formal solufions. This allowed 
us fo look af longer plarming horizons, which is imporfanf for fhe full power of fhe infenfional 
modeling fo become clear. 


Here, we firsf seek fo use fhis new and more powerful planning mefhod fo undersfand fhe 
classes of behaviour fhat arise from differenf seffings of fhe paramefers in secfion 4.3 From 
fhe sfudy of human inferacfions ||14|| , fhe imporfance of coaxing (refuming more fhan fhe fair 
splif) has been esfablished. From our own sfudy of fhe dafa collecfed so far, we define four 
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coarse t 5 ^es of 'pure' interactions, which we call "Cooperation", "Coaxing to Cooperation", 
"Coaxing to Exploitation", "Greedy" ; we conceptualize how these might arise. We also delimit 
the potential consequences of having overly restricted the planning horizon in past work in 
this domain, and examine the qualitative interactive signatures (such as how quickly average 
investments and repayments rise or fall) that might best capture the characteristics of human 
subjects playing the game. 


We then continue to discuss the quality of statistical inference, by carrying out model inversion 
for our new method in section 4.7 and comparing to earlier work in this domain ] 


Finally, we treat real subject data collected for an earlier study ( EOll l in section [4^ and show that 
our new approach recovers significant behavioural differences not obtained by earlier models 
and offers a significant improvement in the classification of subject behaviour through the in¬ 
clusion of the plarming parameter in the estimation and the quality of estimation on the trustee 
side. 


The materials used in this section, as well as the code used to generate them, can be found 
on Andreas Hula's github repository All material was generated on the local WTCN cluster. 
We used R 138] and Matlab 13^ for data analysis and the boost C++ libraries l40l for code 
generation. 


4.1 Modalities 

All simulations were run on the local cluster at the Wellcome Trust Centre for Neuroimaging. 
For sample paths and posterior distributions, for each pairing of investor guilt, investor sophis¬ 
tication and trustee guilt and trustee sophistication, 60 full games of 10 exchanges each were 
simulated, totaling 8100 games. Additionally, in order to validate the estimation, a uniform mix 
of all parameters was used, implying a total of 2025 full games. 

To reduce the variance of the estimation, we employed a pre-search method. Agents with ToM 
greater than 0 first explored the constant strategies (offering/returning a fixed fraction) to obtain 
a minimal set of Q values from which to start searching for the optimal policy using SoftUCT. 
This ensures that inference will not "get stuck" in a close-to-optimal initial offer just because 
another initial offer was not adequately explored. This is more specific than just increasing 
the exploration bonus in the SoftUCT rule, which would diffuse the search during all stages, 
rather than helping search from a stable initial grid. 

We set a number n of simulations for the initial step, where the beliefs about the partner are still 
uniform and the time horizon is still furthest away. We then reduce the number of simulations 
as the time horizon approaches {n,n^,n^,... ,nr^) . 


4.2 Simulation And Statistical Inference 

Unless stated otherwise, we employ an inverse temperature in the softmax of /3 = 5 (noting 
the substantial scale of the rewards). The exploration constant for POMCP was set to c = 25. 
The initial beliefs were uniform = 1, Vi, for each subject. For the 3 possible guilt types we 
use the following expression while in text: a = ai is "greedy", a = 02 is "pragmatic" and 
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a = as is "guilty". However, on all the graphs, we give the exact model classification in the 
form I : {k^, ,P^) for fhe investor and T : (fc^, , P'^) for fhe frusfee. 


We present average results over multiple runs generated stochastically from each seffing of fhe 
parameter values. In fhe figures, we report the actual characteristics of invesfor and frusfee; 
however, in keeping with the overall model, although each agent knows their own parameters, 
they are each inferring fheir opponenfs' degree of guilf based on fheir inifial priors. 


As a consequence of fhe observafion in secfion 3.8.1 we only consider k G {0,2} for fhe invesfor 
and k G {0,1} for fhe frusfee. Plarming horizons are resfricfed to P G {0,2, 7}, as noted in 
secfion 3.7 wifh fhe level 0 frusfee always having a plarming horizon of 0. 


Acfions for bofh agenfs are paramefrized as in secfion [3l7| and averaged across idenfical pa- 
ramefer pairings. In fhe graphs, we show acfions in terms of fhe percenfages of fhe available 
poinfs fhaf are offered or refurned. For fhe invesfor, fhe numerical amounfs can be read di- 
recfly from fhe graphs; for fhe frusfee, fhese amounfs depend on fhe investor's acfion. In fhe 
figures, we reporf fhe actual characferisfics of invesfor and frusfee; however, in keeping wifh fhe 
overall model, alfhough each player knows fheir own paramefers, fhey are each inferring fheir 
opponenfs' degree of guilf based on fheir initial priors. 


Dual to generating behaviour from fhe model is to inverf if to find parameter seffings fhaf besf 
explain observed inferacfions Il24ll20ll . Concepfually, fhis can be done by simulafing exchanges 
befween parfners of given parameter seffings (fc, a, P), faking fhe observed hisfory of invesf- 
menfs and responses, and using a maximum likelihood esfimafion procedure which finds fhe 
seffings for bofh agenfs fhaf maximise fhe chance fhaf simulated exchanges befween agenfs 
possessing those values would match the actual, observed exchange. We calculate the action 
likelihoods through the POMCP method outlined in the earlier section 3.2 and accumulate the 
negative log likelihoods, looking for fhe combinafion fhaf produces fhe smallesf negafive log- 
likelihood. This is carried ouf for each combinafion of guilf and sophisfication for bofh invesfor 
and frusfee. 


4.3 Paradigmatic Behaviours 


Figures (wifh fhe addifional oufcome comparison in figure [T^, and figures [13] and [14] show 
fhe fhree characferisfic f 5 rpes of behaviour, in each case for fwo sefs of paramefers for invesfor 
and frusfee. The upper graphs show fhe average histories of acfions of fhe invesfor ( blue) and 
frusfee (red) across fhe 10 rounds; fhe middle graphs show fhe mean posterior disfribufions 
over the three guilt parameters (0,0.4,1) as estimated by the investor and the tower graphs 
show the mean posterior distribution by the trustee (right) at four sfages in fhe game (rounds 
0, 3, 6 and 9). These show how well fhe agenfs of each fype are making inferences abouf fheir 
parfners. 


Figure [^shows evidence for sfrong cooperafion befween fwo agenfs who are characterized by 
high inequify aversion (i.e., guilfy). Cooperafion develops more slowly for agenfs wifh shorfer 
(leff) fhan longer (righf) plarming horizons, enabling a reliable disfincfion befween different 
guilty pairs. This is shown more explicitly in figure 12 in terms of fhe fofal amounf of money 
made by bofh parficipanfs. Bofh cases can be seen as cases of a fif for faf like approach by 
fhe players, although unlike a strict tit for faf mechanism fhe process leading to high level 
cooperafion is generally robusf againsf following below par acfions by eifher player. Rafher, 
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high level players would employ coaxing to reinforce cooperation in this case. This is true 
even for lower level players, as after they have formed beliefs of the partner, they will not 
immediately reduce their offers upon a few low offers or returns, due to the Bayesian updating 
mechanism. 

The posterior beliefs show both partners ultimately inferring the other's guilt type correctly in 
both pairings, however the = 7 investors remain aware of the possibility that the partners 
may actually be pragmatic and therefore the high level long horizon investors are prone to 
reduce their offers preemptively towards the end of the game. This data feature was noted in 
particular in the study 1201 and our generative model provides a generative explanation for it, 
based on the posterior beliefs of higher level agents explained above. 

Figure [13] shows that level 1 trustees employ coaxing (returning more than the fair split) to get 
the investor to give higher amounts over extended periods of time. In the example settings, the 
level 0 investor completely falls for the trustee's initial coaxing (left), coming to believe that the 
trustee is guilty rather than pragmatic until towards the very end. However, the level 2 investor 
(right) remains cautious and starts reducing offers soon after the trustee gets greedy, decreasing 
their offers faster than if playing a truly guilty t 5 rpe. The level 2 investor on average remains 
ambiguous between the partner being guilty or pragmatic. Either inference prevents them from 
being as badly exploited as the level 0 investor. 

In these plots, investor and trustee both have long planning horizons; we later show what 
happens when a trustee with a shorter horizon (P^ = 2) attempts to deceive. 
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Figure 11: Averaged Exchanges (upper) and posteriors (mid and lower). Left plots: Investor 
(fc^, P^) = (2,1, 2); Trustee (1,1, 2); right plots: Investor (2,1, 7) and Trustee (1,1, 7). The posterior 

distributions are shown for a = (0,0.4,1) at four stages in the game. Error bars are standard deviations. 
The asterisk denotes the true partner guilt value. 
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■I (ToM 2, G: 1, P: 2) paired with T (ToM 1, G: 1, P: 2) 
□I (ToM 2, G: 1, P: 7) paired with T (ToM 1, G: 1, P: 7) 

300r 


T3 

0) 290 



Figure 12: Average Overall gains for the exchanges in figure 111 with planning 2 (dark blue) and 7 (light 
blue). The difference is highly significant (p < 0.01) at a sarnple size of 60 for both parameter settings. 
Error bars are standard deviations. 


A level 1 trustee can also get pragmatic investors to cooperate through coaxing, as demonstrated 
in figure The returns are a lot higher than for a level 0 guilfy frusfee, who lacks a model of 
fheir influence on fhe investor, and hence does nof refurn enough to drive up cooperafion. This 
initial coaxing is a very common behaviour of high level healfhy frusfees, frying fo gef fhe 
investor fo cooperafe more quickly, for bofh guilfy and pragmatic high level frusfees. 
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Figure 13: Averaged Exchanges (upper) and posteriors (mid and lower). Left plots: Investor 
{k\ a\ P^) = (0,1, 7); Trustee (1, 0.4, 7); right plots: Investor (2,1, 7) and Trustee (1, 0.4, 7). The posterior 
distributions are shown for a = (0,0.4,1) at four stages in the game. Error bars are standard deviations. 
The asterisk denotes the true partner guilt value. 
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Figure 14: Average Exchanges (upper) and posteriors (mid and lower), Investor (0,0.4, 7) and Trustee 
(1,1,7). The posterior distributions are shown for a — (0, 0.4,1) at four sfages in the game. Error bars are 
standard deviations. The asterisk denotes the true partner guilt value. 
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4.4 Inconsistency or Impulsivity 
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Figure 15: Average Exchanges (upper) and posteriors (mid and lower). Investor ( 0 , 1 , 2 ) and Trustee 
( 1 , 0 . 4 , 2 ). The posterior distributions are shown for a = ( 0 , 0 . 4 , 1 ) at four stages in the game. Error bars 
are standard deviations. The asterisk denotes the true partner guilt value. 


Trustees with planning horizon 2 tend to find it difficult to maintain deceptive strategies. As 
can be seen in figure 15 even when bofh agenfs have a plarming horizon of 2, a shorf sighfed 
frusfee builds significanfly less frusf fhan a long sighfed one. This is because if fails fo see 
sufficienfly far in fhe fufure, and exploifs foo early. This planning horizon fhus capfures cogni¬ 
tive limifafions or impulsive behaviour, while fhe planning horizon of 7 generally describes fhe 
consisfenf execufion of a sfrafegy during play. Such a distinction may be very valuable for fhe 
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study of clinical populations suffering from psychiatric disorders such as attention deficit hy¬ 
peractivity disorder (ADHD) or borderline personality disorder (BPD), who might show high 
level behaviours, but then fail to maintain them over the course of the entire game. Inferring 
this requires the ability to capture long horizons, something that had eluded previous methods. 
This type of behaviour shows how important the availability of different planning horizons is 
for modeling, as earlier implementations such as Il24ll would treat this impulsive type as the 
default setting. 


4.5 Greedy Behaviour 
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Figure 16: Averaged Exchanges (upper) and posteriors (mid and lower). Left plots: Investor 
(fc^, ,P^) = (0, 0, 7); Trustee (1, 0, 7); right plots: Investor (2,0, 7) and Trustee (1, 0, 7). The posterior 

distributions are shown for a = (0,0.4,1) at four stages in the game. Error bars are standard deviations. 
The asterisk denotes the true partner guilt value. 


Another behavioural phenot 5 rpe with potential clinical significance arises with fully greedy 
partners, see figure Greedy low level investors only invest very little, even if trustees try 
to convince them of a high guilt type on their part as described above (coaxing). Cooperation 
repeatedly breaks, which is reflected in the high variability of the investor trajectory. Two high 
level greedy types initially cooperate, but since the greedy trustee egregiously over-exploits, co¬ 
operation usually breaks down quickly over the course of the game, and is not repaired before 
the end. In the present context, the greedy t 5 q)e appears quite pathological in that they seem 
to hardly care at all about their partners' t 5 rpe. The main exception to this is the level 2 greedy 
investor (an observation that underscores how theory of mind level and planning can change 
behaviour that would seem at first to be hard coded in the inequality aversion utility function). 
The level 0 greedy investor will cause cooperation to break down, regardless of their beliefs, as 
in figurej^the posterior beliefs of the level 0 show that they believe the trustee to be guilty, but 
do not alter their behaviour in the light of this inference. 


28 


4.6 Planning Mismatch - High Level Deceived By Lower Level 
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Figure 17: Average Exchanges, Investor (2,1, 2) and Trustee (1,0.4, 7). Error bars are standard deviations. 
The asterisk denotes the true partner guilt value. 


In figure the investor is level 2, and so should have the wherewithal to understand the 
level 1 trustee's deception. However, the trustee's longer planning horizon permits her to play 
more consistently, and thus exploit the investor for almost the entire game. This shows that 
the advantage of sophisticated thinking about other agents can be squandered given insuffi¬ 
cient plarming, and poses an important question about the efficient deployment of cognitive 
resources to the different demands of modeling and planning of social interactions. 
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4.7 


Confusion 


4.7.1 Model Inversion 


A minimal requirement for using the proposed model to fit experimental data is self-consistency. 
That is, it should be possible to recover the parameters from behaviour that was actually gener¬ 
ated from the model itself. This can alternatively be seen as a test of the statistical power of the 
experiment - i.e., whether 10 rounds suffice in order to infer subject parameters. Figure [T8]sho ws 
the confusion matrix which indicates the probabilities of the inferred guilt (top), ToM (middle) 
and planning horizon (bottom) for investor (left) and trustee (right), in each case marginalizing 
over all the other factors. We discuss a particular special case of the obtained confusion in 19 
Said confusion relates to observations made in empirical studies (see ||T9ll20l ) and suggests the 
notion of the planning parameter, as measure of consistency of play. Later, we show compar¬ 
ative data reported in the study Il24]| , which only utilized a fixed planning horizon of 2 and 2 
guilt states (and did not exploit the other simplifications that we introduced above), see figure 
20 for a depiction of the levels of confusion in that study. These simplifications implied that the 
earlier study would find recovery of theory of mind in particular to be harder. 
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Figure 18: Percentage of inferred guilt, theory of mind and planning horizon for investor (left) and trustee 
(right) as a function of the true values, marginalizing out all the other parameters. Each plot corresponds 
to a uniform mix of 15 pairs per parameter combination and partner parameter combination. 


Guilt is recovered in a highly reliable manner. By contrast, there is a slight tendency to overes¬ 
timate ToM in the trustees. The greatest confusion turns out to be inferring a = 7 invesfor as 
having P^ = 2 when playing an impulsive frusfee (P^ = 2), a problem shown more direcfly in 
Figure [19] 
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Figure 19: Maximum likelihood estimation result, P^ — 7 and P^ = 2 agent combinations, marginalized 
maximum likelihood estimation of investor planning horizon over all other parameters. 

The issue is that when the trustee is impulsive, far-sighted investors (P^ = 7) can gain no advan¬ 
tage over near-sighted ones (P^ = 2), and so the choices of this dyad lead to mis-estimation. Al¬ 
ternatively put, an impulsive trustee brings the investor down to his or her level. This has been 
noted in previous empirical studies, notably 11191 ISOl 's observations of the effect on investors of 
playing erratic trustees. The same does not apply on the trustee side, since the reactive nature 
of the trustee's tactics makes them far less sensitive to impulsive investor play. 

Given the huge computational demands of planning, it seems likely that investors could react 
to observing a highly impulsive trustee by reducing their own actual plarming horizons. Thus, 
the inferential conclusion shown in figure [T^ may in fact not be erroneous. However, this pos¬ 
sibility reminds us of the necessity of being cautious in making such inferences in a two-player 
compared to a one-player setting. 


4.7.2 Confusion Comparison to earlier Work 

We compare our confusion analysis to the one carried out in the grid based calculation in Il24ll . 
In 1241 the authors do not report exact confusion metrics for the guilt state, only noting that it 
is possible to reliably recover whether a subject is characterized by high guilt (0.7) or low guilt 
(0.3). We can however compare to the reported ToM level recovery. The comparison with 1241 
faces an additional difficulty in that despite using the same formal framework as this present 
work, the indistinguishability of the level 1 and 2 trustees and the level 0 and 1 investors was not 
identified yet. This explains the somewhat higher amount of confusion when classifying ToM 
levels, reported in [^. Also, since calculation of the Dirichlet-Multinomial probability was 
done numerically in this study, some between level differences will only derive from changes 
in quadrature points for higher levels. As can be seen in figure]^ (left), almost all of the level 
1 trustees at low guilt are misclassified. This is due to them being classified as level 2 instead, 
since both levels have the same behavioral features, but apparently the numerical calculation 
of the belief state favored the level 2 classification over the level 1 classification. The tendency 
to overestimation is true on the investor side as well, with there being a considerable confusion 
between level 0 and level 1 investors, who should behaviorally be equivalent. In sum, this leads 
to the reported overestimation of the theory of mind level. We have depicted the confusion 
levels reported in 12411 in figure 
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Figure 20: Classification probability reported in l24l . In analogy to figure [l8| we depict the generated vs 
estimated values in a matrix scheme. 


4.8 Computational Issues 

The viability of our method rests on the rurming time and stability of the obtained behaviours. 
In figure]^ we show these for the case of the first action, as a function of the number of simu¬ 
lation paths used. All these calculations were run at the local Wellcome Trust Center for Neu¬ 
roimaging (WTCN) cluster. Local processor cores where of Intel Xeon E312xx (Sandy Bridge) 
type clocked at 2.2 GHz and no process used more than 4 GB of RAM. Note that, unless more 
than 25fc paths are used, calculations take less than 2 minutes. 
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Figure 21: (left) Average running times for calculating the first action value of a level 2, guilt 1 investor 
from a given number of simulations, as a function of planning horizon (complexity), (right) Discrepancy 
to the converged case of the action probabilities for the first action measured in squared discrepancies. 


We quantify simulation stability by comparing simulations for a level 2 investor (a reasonable 
upper bound, because the action value calculation for this incorporates the level 1 trustee re¬ 
sponses) based on varying numbers of paths with a simulation involving 10® paths that has 
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converged. We calculate the between (simulated) subject discrepancies C of the probabilities 
for fhe firsf acfion for e {2,3,4,5,6,7,8,9}: 


, 120 
Co =— y"( 

o iig 

fc=l 


On = - - . 


K = 4])r[«o = ^]-: 


= -]) 


i,j e {0, ...,4} 


where Pjog = |] are fhe converged probabilifies, and P^[aQ] is fhe acfion likelihood of simu- 
lafed subjecf k. If fhe sum of squares of fhe enfries in fhe discrepancy mafrix is low, fhen fhe 
probabilifies will be close fo fheir converged values. 

As can be seen from figure (righf), for 25k pafhs even planning 9 sfeps ahead agenfs have 
converged in fheir inifial acfion probabilifies, such fhaf fheir acfion probabilifies vary from fhe 
converged value by no more fhan abouf 0.1. However, nofe fhaf fhis convergence is nof always 
monofonic in eifher fhe planning horizon or fhe number of sample pafhs. The former is influ¬ 
enced by fhe differing complexify of preferences for differenf horizons - somefimes, acfions are 
harder fo resolve for shorf fhan long horizons. The latter is influenced by fhe inifial pre-search 
using consfanf sfrafegies. 

Alfhough 25k sfeps suffice for convergence even when planning 9 sfeps ahead, fhis horizon 
remains compufafionally challenging. We fhus considered whefher if is possible fo use a shorter 
horizon of 7 sfeps, wifhouf maferially changing fhe preferred choices. Figure |^illusfrafes fhaf 
fhe difference is negligible compared wifh fhe flucfuafions of fhe Monfe Carlo approach, even 
for fhe worsf case involving fhe pairing of 2 pragmatic fypes, wifh high ToM levels and long 
planning horizons. Af fhe same fime, fhe calculafion for P = 7 is twice as fasf as P = 9 for fhe 
level 2 investor, which even jusf for fhe firsf acfion is a difference of 100 seconds. 



l:(ToM 2, G: 0.4, P: 7) 
T:(ToM 1, G: 0.4, P: 7) 
0 

T:(ToM 1, G: 0.4, P: 9) 


Figure 22: Average Exchanges, Investor (2,0.4, 7) (dark blue) and Trustee (1,0.4, 7) (red), as well as 
Investor (2,0.4, 9) (light blue) and Trustee (1, 0.4, 9) (rose). The difference between the 2 planning horizons 
is not significant at any point. Error bars are standard deviations. 


Finally we compare our algorithm at plarming 2 steps ahead to the grid-based calculation used 
before Il24ll20l . The speed advanfage is a facfor of 200 for 10"* pafhs in POMCP demonsfrafing 
fhe considerable improvemenf fhaf enables us fo consider longer plarming horizons. 
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4.9 Comparison To Earlier Subject Classifications 

We will show below, using real subject data taken from 1201 , that our reduction to 3 guilt states 
does not render likelihoods worse and only serves to improve classification quality. We com¬ 
pared the results of our new mefhod wifh fhe resulfs obfained in earlier sfudes ( 1^1 , l20l l. 


4.9.1 Dataset 

We performed inference on fhe same dafa sefs as in Xiang ef al. EOl (which were parfially 
analysed in 1241 , 1141 and UHl). This involved 195 dyads playing fhe frusf game over 10 ex¬ 
changes. The investor agenf was always a healfhy subjecf, fhe frusfees comprised various 
clinical groups, including anonymous, healfhy frusfees (fhe "impersonal" group; 48 subjecfs), 
healfhy frusfees who were briefly encountered before fhe experimenf (fhe "personal" group; 52 
subjecfs), frusfees diagnosed wifh Borderline Personalify Disorder (BPD) (fhe "BPD" group; 55 
subjecfs), and anonymous healfhy frusfees mafched in socio-economic sfafus (SES) fo fhe (lower 
fhan healfhy) SES disfribufion of BPD frusfees, (fhe "low SES" group; 38 subjecfs). 


4.9.2 Models Used 

We compared our models fo fhe resulfs of fhe model used in l20l on fhe same dafa sef (which 
incorporates fhe dafa sef used in 12^ 1. The sfudy l20l uses 5 guilf sfafes {0,0.25,0.5,0.75,1} 
compared fo our 3, a plarming horizon of 2 and an inverse femperafure of 1, ofherwise fhe for¬ 
mal framework is exacfly fhe same as in section |3.7| Action values in IfZOl were calculated by 
an exacf grid search over all possible histories and a numerical infegrafion for fhe calculafion of 
fhe belief sfafe. Eor comparison purposes we builf a "clamped" model in which fhe planning 
horizon was fixed af fhe value 2, wifh 3 guilf sfafes and a inverse femperafure sef fo /3 = |. Ad- 
difionally, we compared fo fhe oufcome for fhe full mefhod in fhis work, including esfimafion 
of fhe planning horizon. We noted fhaf in fhe analysis in Il20ll , an additional approximation had 
been made af fhe level 0 investor level, which sef fhose invesfors as non learning. This kepf 
fheir beliefs uniform and yielded much better negative loglikelihoods wifhin said model, fhan 
if fhey were learning. 


4.9.3 Subject Fit 

A minimal requirement to accept subject results as significant is that the negative log likelihood 
is significantly better than random on average at p < 0.05, otherwise we would not trust a 
model based analysis over random chance and the estimated parameters would be unreliable. 
This criterion is numerically expressed as a negative loglikelihood of 16.1 for 10 exchanges, 
calculafed from 5 possible actions af a probabilify of 0.2 each, wifh independenf acfions each 
round. 

Eor fhe analysis in 120] , we found fhaf fhe special approximation made in 120]| allowed for sig- 
nificanfly beffer negafive log likelihoods (mean 11.98); if fhis approximation is removed, fhe 
investor dafa fif af an inverse femperafure of 1 would be worse fhan random for fhis dafa sef. 
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Additionally, the model used in EOll did not fit the trustee data significantly better than random 
at p < 0.05 (mean negative loglikelihoods 15.6 and standard deviation of > 3). 

Conversely, for bofh our clamped and full model analysis at /3 = the trustee likelihood is 
significantly better than random (11.7 at the full model) and fhe invesfor negafive loglikelihood 
is slightly better on average (smaller) than found in 1201 with 5 guilt states (11.7 for our mefhod, 
vs 11.98). This confirms fhaf reducing fhe number of guilf sfafes fo 3 only reduces confusion 
and does nof worsen fhe fif of real subjecfs dafa. Addifionally if becomes newly possible fo 
perform model-based analyses on fhe BPD frusfee guilf sfafe disfribufion, since fhe old model 
did nof fif frusfees significanfly beffer fhan random af p < 0.05. 

The seemingly low inverse femperafure af /3 = ^ is a consequence of fhe size of fhe rewards and 
fhe quick accumulafion of higher expecfafion values wifh more planning sfeps, as fhe inverse 
femperafure needs fo counfer balance fhe expecfafion size fo keep choices from becoming nearly 
deferminisfic. Average invesfor reward expecfafions (af fhe firsf exchange) for planning 0 sfeps 
sfand af 18 wifh an average 18 being added af each planning sfep. 


36 


Investor ToM Distribution, 
Xiang et ai. 


Trustee Guilt Distribution, 
Xiang et ai. 


■ Impersonal Control 

□ Personal Control 

□ BPD 

□ Low SES 



ToM Parameter 


Investor ToM Distribution for 



ToM Parameter 

Investor ToM Distribution, 
full planning model 



■ Impersonal Control 

□ Personal Control 

□ BPD 

□ Low SES 


ToM Parameter 



■ Impersonal Control 

□ Personal Control 

□ BPD 

□ Low SES 


Mk 


^ _ 

0.25 0.5 0.75 

Guilt Parameter 


Trustee Guilt 

Distribution, Planning fixed at 2 


■ Impersonal Control 

□ Personal Control 

□ BPD 


i\L 

0.4 

Guilt Parameter 


Trustee Guilt Distribution, 
full planning model 




■ Impersonal Control 

□ Personal Control 

□ BPD 

□ Low SES 



Guilt Parameter 


Figure 23: Parameter Distributions for different models on the data set of EOl . (upper leff) Investor ToM 
distribution is significant (p < 0.05) between the impersonal control condition and all other conditions, 
(upper right) Trustee Guilt distribution is significant between impersonal controls and the BPD trustees, 
(middle left) Planning 2 investor ToM distribution with 3 guilt states. BPD and low SES differences to 
impersonal are significant, (middle right) Planning 2 trustee guilt, the difference between BPD trustees 
and impersonal controls is significant, (bottom left) Full planning model investor ToM, all differences to 
impersonal are significant, (bottom right) Full planning model trustee guilt. BPD trustees are significantly 
different from controls. The asterisk denotes a significant (p < 0.05) difference in the Kolmogorov-Smirnov 
two sample test, to the impersonal control group. 
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4.9.4 Marginal Parameter Distributions -Significant Features 


Figure shows the significant parameter distribution differences (Kolmogorov-Smirnov two 
sample test, p < 0.05). For investor theory of mind and trustee guilt distribution, many of the 
same differences are significant for the analysis reported in EOl (see Fig. upper panels), 
for an analysis using our model with a "clamped" planning horizon of 2 steps ahead (see Fig. 


23 middle panels, to match with the approach of ijlU l and for our full model, using 3 guilt 


states, ToM level up to 2 and 3 planning horizons (see Fig. 23 bottom panels and Fig. 241. 


We find significantly lowered ToM in most other groups, compared to the impersonal control 
group. We find a significantly lowered guilt distribution in BPD trustees, however the guilt 
difference was not used for fMRI analysis in ||201 , because, as noted above, the trustee was 
not fit significantly better than random at p < 0.05 in the earlier model. For our full model 
with 3 planning values, we find additional significant differences on the investor side: While 
all ToM distributions are significantly different from the impersonal condition, the planning 
difference between the personal and impersonal conditions is not significant at p < 0.05, while 
it is significant for the other groups (see Fig. [24| . Thus, this is the only model keeping the 
parameter distribution of the personal group distinct from both the impersonal group (from 
which it is not significantly different in the clamped model) and the low SES playing controls 
and BPD playing controls (from which it is not significantly different based on the parameters 
in EDI ) at the same time. 


This supports the planning horizon as a "consistency of play" and additional rationality mea¬ 
sure, as the subjects do not think about possible partner deceptions as much in the personal 
condition, having just met the person they will be playing (resulting in lowered ToM). How¬ 
ever, their play is non disruptive, if low level, and consistent exchanges result. BPD and low 
SES trustees however disrupt the partners' play, lowering their planning horizon. 
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<D 


Planning Distribution, in 
full model 

■ Impersonal Control 

□ Personal Control 

□ BPD 

□ Low SES 



0 2 7 

Planning Parameter 


Eigure 24: Planning distribution for Investors, distinguished between personal condition controls (non 
significant) and BPD and low SES trustees (significantly lower than impersonal). The asterisk denotes a 
significant (p < 0.05) difference in the Kolmogorov-Smirnov two sample test, to the impersonal control 
group. 
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5 Discussion 


We adapted the Monte-Carlo tree search algorithm designed for partially observable Markov 
decision processes iMl to the interactive, game-theoretic, case l23l . We provide significant sim¬ 
plifications to the case of dyadic social exchange, which benefif any IPOMDP based mefhod. We 
illusfrafed fhe power of fhis mefhod by exfending fhe compufafionally viable plarming horizon 
in a complex, mulfi-round, social exchange game fo be able fo encompass characferisfic be¬ 
haviours fhaf have been seen in human play IfTSlI . 


We also showed fhaf fhe 10 rounds fhaf had been used empirically suffice fo license high qualify 
inference abouf paramefer values, af leasf in fhe case fhaf fhe behaviour was generafed from 
fhe model ifself. We exhibifed fhree fundamenfal forms of d 5 mamical behaviour in fhe fask: 
cooperafion, and two differenf varieties of coaxing. The algorifhm generafes values, sfafe-acfion 
values and posferior beliefs, all of which can be used for such mefhods as model-based fMRI. 


We find fhaf fhe resulfs in 4.4 4.6 and figures and 24 confirm fhe plarming horizon as a 
consisfency of play paramefer, fhaf encodes fhe capabilify of a subjecf fo execufe a consisfenf 
sfrafegy fhroughouf play. As such if may be disrupfed by fhe behavior of shorfer plarming 
partners, as can be seen in [T^andpl] 


Furthermore, comparing to earlier data used in the work 1120 II we can confirm fhe relevance of 
fhe planning paramefer in fhe freafmenf of real subjecf dafa, classifying subjecf groups along 
fhe new axis of consisfency of play. 


The newly finer classification of subjecfs along fhe fhree axes of fheory of mind, plarming hori¬ 
zon and guilf (fc, P, a) should provide a rich framework fo classify deficifs in clinical popula¬ 
tions such as an inabilify fo model ofher people's beliefs or infenfions, ineffective model-based 
reasoning, and a lack of empafhy. Such analyses can be done af speed, of fhe order of 10s of 
subjecfs per hour. 


One mighf ask whefher fhe behavioural patterns derived in fhis work mighf be obfained wifh- 
ouf invoking fhe cognitive hierarchy and insfead using a large enough sfafe space, which en¬ 
codes fhe preferences and sophistication of fhe ofher agenf as many separafe sfafes, rafher fhan 
a few t 5 rpe paramefers plus fhe cognitive hierarchy. This is in principle possible, however we 
prefer ToM for 2 reasons: Firsfly, fhe previous sfudy Il20l and ofhers have found neural supporf 
for fhe disfincfion befween high ToM and low ToM subjecfs in real play suggesting fhaf fhis 
distinction is nof buf a mafhemafical convenience (cf. EDI , p.4 and 5 for a neural represenfa- 
fion of predicfion errors associafed fo level 0 and level 2 fhinking). Secondly, we can specify 
feafures of inferesf, such as inequalify aversion and planning af fhe lowesf level, fhen generafe 
high level behaviours in a way fhaf yields an immediafe psychological inferprefafion in ferms 
of fhe menfalizafion sfeps encoded in fhe ToM level. 


The algorifhm opens fhe door fo finer analysis of complicafed social exchanges, possibly allow¬ 
ing opfimizafion over initial prior values in fhe esfimafion or fhe analysis of higher levels of 
fheory of mind, af leasf on fasks wifh lower fan-ouf in fhe search free. If would also be possible 
fo search over fhe inverse femperafure /3. 


One imporfanf lacuna is fhaf alfhough if is sfraighfforward fo use maximum likelihood fo search 
over fixed paramefers (such as ToM level, planning horizon or indeed femperafure), if is radi¬ 
cally harder fo perform fhe compufafions fhaf become necessary when fhese facfors are incorpo- 
rafed info fhe sfrucfure of fhe infenfional models. Thai is, our subjecfs were assumed fo make 
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inferences about their opponent's guilt, but not about their theory of mind level or planning 
horizon. 

It is possible that additional tricks would make this viable for the trust task, but it seems more 
promising to devise or exploit a simpler game in which this would be more straightforward. 
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